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Abstract 


'  Using  parallel  processors  to  reduce  the  execution  times  of  classical  circuit  simulation 
programs  like  SPICE  and  ASTAP  has  been  the  focus  of  much  current  research.  In  these 
efforts,  good  parallel  speed  increases  have  been  achieved  for  linearized  system 
construction,  but  it  has  been  difficult  to  get  good  parallel  speed  increases  for  sparse 
matrix  solution.  In  this  paper  we  examine  two  approaches  for  reducing  parallel  sparse 
matrix  solution  time;  the  first  based  on  pivot  ordering  algorithms  for  Gaussian 
elimination,  and  the  second  based  on  relaxation  algorithms.  In  the  section  on  Gaussian 
elimination  sparse  matrix  solution,  we  present  a  pivot  ordering  algorithm  which  increases 
the  parallelism  of  Gaussian  elimination  compared  to  the  commonly  used  Markowitz 
method.  The  performance  of  the  new  algorithm  is  compared  to  other  suggested 
ordering  algorithms  for  a  collection  of  circuit  examples.  The  minimum  number  of  parallel 
steps  for  the  solution  of  a  tridiagonal  matrix  is  derived,  and  it  is  shown  that  this  optimum 
is  nearly  achieved  by  the  ordering  heuristics  which  attempt  to  maximize  parallelism.  In 
the  section  on  relaxation,  we  present  an  optimality  result  about  Gauss-Jacobi  over 
Gauss-Seidel  relaxation  on  parallel  processors. 

_ _  iY 

DISTRIBUTION  STATEMENT  lf| 

AppzoT&d  for  public  n  loaoe,  l 
Dwtiibution  llnllmitud 


v  c'osv$te’~s 
Pesearch  Cen-s 
Poom  39-32  ■ 


Massachusetts 

Inst  tute 
o‘  Technology 


Carnbr'dge 

Massachusetts 

02139 


Telephone 
(617)  253-8138 


n  9 


■npnwnpT  nrviiMwwwm  v  wlh  r  rn  w  T*T*  **  t  W  %X  TX  H  fW  Iff  H"  W  WJ  Cff  WXWTWTH&  WfWW  WWW  WWMMWWl^  fll 


,\V 


& 


;  «&' 


<  V 


\f 


>v 


<V  . 

v  '<< 


*>  •■ .  -  ■i  •/  *':•.■•<•: 


‘"1 
-  -1 


ul::  , 


^  f'i-  ., 


flrl! 


Acknowledgements 


This  research  was  funded  in  part  by  Sandia  National  Laboratory  contract  number  02- 
8522,  Semiconductor  Research  Corporation  contract  number  86-12*109,  and  by  the 
Defense  Advanced  Research  Projects  Agency  under  contract  number  N00014-87-K- 
0825. 


Author  Information 


Smart:  Coordinated  Science  Laboratory,  University  of  Illinois  at  Urbana-Champaign, 
1101  West  Springfield  Avenue,  Urbana,  IL  61801,  (217)  333-4847;  White:  Department  of 
Electrical  Engineering  and  Computer  Science,  MIT,  Room  36-880,  Cambridge,  MA 
02139,  (617)  253-2543. 


Copyright®  1988  MIT.  Memos  in  this  series  are  for  use  inside  MIT  and  are  not 
considered  to  be  published  merely  by  virtue  of  appearing  in  this  series.  This  copy  is  for 
private  circulation  only  and  may  not  be  further  copied  or  distributed,  except  for 
government  purposes,  if  the  paper  acknowledges  U.  S.  Government  sponsorship. 
References  to  this  work  should  be  either  to  the  published  version,  if  any,  or  in  the  form 
"private  communication.”  For  information  about  the  ideas  expressed  herein,  contact  the 
author  directly.  For  information  about  this  series,  contact  Microsystems  Research 
Center,  Room  39-321,  MIT,  Cambridge,  MA  02139;  (617)  253-8138. 


Reducing  the  Parallel  Solution  Time  of 
Sparse  Circuit  Matrices  Using  Reordered 
Gaussian  Elimination  and  Relaxation 


David  Smart 

Coordinated  Science  Laboratory 
University  of  Illinois  at  Urbana-Champaign 

Jacob  White 

Dept,  of  Elec.  Eng.  and  Comp.  Sci.,  M.I.T. 


Abstract 

Using  parallel  processors  to  reduce  the  execution  times  of  classical  circuit  sim¬ 
ulation  programs  like  SPICE  and  ASTAP  has  been  the  focus  of  much  current 
research.  In  these  efforts,  good  parallel  speed  increases  have  been  achieved  for 
linearized  system  construction,  but  it  has  been  difficult  to  get  good  parallel 
speed  increases  for  sparse  matrix  solution.  In  this  paper  we  examine  two  ap¬ 
proaches  for  reducing  parallel  sparse  matrix  solution  time:  the  first  based  on 
pivot  ordering  algorithms  for  Gaussian  elimination,  and  the  second  based  on 
relaxation  algorithms.  In  the  section  on  Gaussian  elimination  sparse  matrix 
solution,  we  present  a  pivot  ordering  algorithm  which  increases  the  parallelism 
of  Gaussian  elimination  compared  to  the  commonly  used  Markowitz  method. 
The  performance  of  the  new  algorithm  is  compared  to  other  suggested  ordering 
algorithms  for  a  collection  of  circuit  examples.  The  minimum  number  of  par¬ 
allel  steps  for  the  solution  of  a  tridiagonal  matrix  is  derived,  and  it  is  shown 
that  this  optimum  is  nearly  achieved  by  the  ordering  heuristics  which  attempt 
to  maximize  parallelism.  In  the  section  on  relaxation,  we  present  an  optimality 
result  about  Gauss-Jacobi  over  Gauss-Seidel  relaxation  on  parallel  processors. 

1  Introduction 

Designers  of  high  performance  integrated  circuits  make  extensive  use  of  circuit 
simulation  programs  like  SPICE  and  ASTAP  [NAG,  WEE]  in  order  to  tune 
their  designs  before  fabrication.  These  circuit  simulation  programs  often  require 
hours  or  days  to  complete  a  single  simulation  because  they  use  computationally 
expensive,  but  very  reliable,  numerical  techniques.  Since  many  simulations  are 


performed  to  design  a  given  integrated  circuit,  slow  simulator  turn-around  time 
can  significantly  increase  overall  design  time.  For  this  reason,  using  parallel 
processors  to  reduce  the  execution  times  of  circuit  simulation  programs  has 
been  the  focus  of  much  current  research[COX,  JAC]. 

Programs  like  SPICE  and  ASTAP  use  implicit  multistep  integration  algo¬ 
rithms  to  convert  the  differential  equation  system  to  a  sequence  of  algebraic 
problems,  one  for  each  integration  timestep.  The  algebraic  problems  are  solved 
using  an  iterative  Newton  method,  each  step  of  which  involves  linearizing  the 
circuit  about  some  guessed  solution,  and  solving  the  generated  sparse  linear  sys¬ 
tem  Good  parallel  speed  increases  have  been  achieved  for  the  linearized  system 
construction,  but  not  for  the  sparse  linear  system  solution. 

In  this  paper  we  examine  two  approaches  for  reducing  parallel  sparse  matrix 
solution  time:  the  first,  presented  in  the  following  section,  based  on  modified 
pivot  ordering  algorithms  for  Gaussian  elimination,  and  the  second,  presented 
in  section  3,  based  on  relaxation  algorithms.  Finally,  in  Section  4,  we  present 
our  conclusions  and  acknowledgements. 

2  Reordering  to  Reduce  Parallel  Solution  Time 

In  this  section,  the  parallel  triangularization,  or  LU  factorization,  phase  of  sparse 
matrix  solution  is  investigated.  We  begin  in  the  next  section  by  describing  the 
LU  factorization  algorithm  and  our  computational  model.  We  then  present  a 
modification  of  the  commonly  used  Markowitz  pivot  ordering  algorithm[MAR] 
which  reduces  the  parallel  LU  factorization  time,  and  compare  it  with  several 
other  suggested  reordering  algorithms  for  a  collection  of  circuit  examples.  We 
then  derive  the  optimum  parallel  solution  for  a  tridiagonal  matrix,  and  show 
that  this  optimum  is  nearly  achieved  for  all  pivot  reordering  algorithms,  except 
Markowitz  and  some  closely  related  algorithms. 

2.1  Parallelism  in  Sparse  Factorization 

The  solution  of  Ax  =  b,  where  A  €  5?nxn  and  x,b  £  3?n,  by  Gaussian  elimina¬ 
tion  can  be  accomplished  in  three  steps:  Triangularization  or  LU  factorization, 
forward  elimination,  and  backward  substitution[GOL].  LU  factorization  is  the 
most  time-consuming  of  the  three,  and  we  will  concentrate  exclusively  on  it. 
The  LU  factorization  algorithm  is,  for  dense  matrices,  given  by  the  following 
nested  loop. 

LU  Factorization 
for  k  =  l  to  n  —  1  process  pivot  k 

for  i  =  k  +  l  to  n 

a,k  =  a,k/akk  d'V'de 
for  j  =  k  +  1  to  n 


For  a  given  pivot  k  in  LU  factorization,  all  the  divide  operations  can  be  done 
concurrently,  and  once  they  are  completed,  all  the  update  operations  can  be 
done  concurrently.  Therefore,  for  large  dense  matrices  the  number  of  concurrent 
divide  operations  is  of  order  n,  and  the  number  of  concurrent  update  operations 
is  of  order  n2,  at  least  during  the  early  stages  of  the  algorithm. 

Circuit  matrices,  which  are  very  sparse,  have  few  nonzero  entries  in  their  rows 
and  columns,  especially  in  the  early  stages  of  the  decomposition,  before  many 
fill-ins  are  generated.  Therefore,  concurrent  processing  of  the  divides  and  then 
the  updates  for  a  single  pivot  does  not  provide  much  parallelism.  In  order  to  get 
greater  degrees  of  parallelism,  different  pivots  must  be  processed  concurrently. 
The  extent  to  which  this  is  possible  depends  on  the  matrix  structure,  which,  in 
turn,  determines  the  dependencies  between  operations. 

In  order  to  investigate  the  parallelism  available  in  sparse  LU  decomposition 
we  use  a  task  graph  model[WIN].  It  is  assumed  that  each  divide  and  each  update 
operation  takes  one  unit  of  time,  and  each  such  operation  is  represented  in  the 
task  graph  as  one  vertex.  Dependencies  between  the  operations  (e.g.  that  the 
divide  a,*  =  a.jt/at*.  must  occur  before  the  update  atJ  =  a,y  -  a„tat;  )  are 
represented  as  directed  edges.  Note  that  such  a  graph  for  a  sparse  matrix  can 
only  be  constructed  once  the  matrix  fill-in  pattern  is  known,  and  therefore  the 
graph  topology  is  a  function  of  the  pivot  ordering  in  the  LU  factorization.  The 
ordering  of  update  operations  also  affects  the  graph  topology,  since  different 
pivots  can  contribute  terms  to  the  same  update  destination,  and  these  update 
operations  must  be  performed  serially. 

Based  on  this  task  graph  model,  and  the  Markowitz  pivoting  order,  it  has 
been  shown  that  the  degree  of  parallelism  for  solving  circuit  matrices  can  be 
as  high  as  10%  of  the  number  of  rows  in  the  matrix.  This  suggests  that  for 
large  circuits,  large  speedups  may  be  possible,  but  for  medium-sized  problems 
additional  parallelism  must  be  exploited.  Speedups  achieved  in  practice  are 
also  limited  by  the  number  of  available  processors  and  by  overhead  due  to  task 
management  and  communications[YAM .  COX]. 

2.2  Pivot  Ordering  for  Improved  Parallelism 

For  uniprocessor  applications  the  goal  of  a  pivot  ordering  algorithm  is  to  min¬ 
imize  the  total  number  of  matrix  operations,  and  this  is  realized  by  using  re¬ 
ordering  algorithms  that  minimize  fill-in  in  the  matrix.  The  commonly  used 
Markowitz  algorithm  produces  a  near-optimal  ordering  by,  at  each  step  of  the 
elimination,  choosing  as  the  next  pivot  a  diagonal  entry  of  the  uneliminated 
submatrix  which  has  the  lowest  Markowitz  count,  defined  as  the  product  of  the 
number  of  nonzero  column  entries  and  nonzero  row  entries. 

For  a  parallel  processor,  the  goal  of  an  ordering  algorithm  should  be  to  re¬ 
duce  the  time  to  complete  the  decomposition,  and  this  may  not  necessarily  be 
the  solution  with  the  fewest  number  of  operations.  One  approach  to  measuring 
the  quality  of  an  ordering  algorithm  for  a  parallel  processor  is  to  compute  the 


length  of  the  longest  directed  path  in  the  task  graph,  which  is  referred  to  as  the 
task  graph  depth.  This  depth  is  equal  to  the  minimum  number  of  steps  required 
to  compute  the  LU  decomposition  given  sufficient  processors,  and  ignoring  over¬ 
head  factors  such  as  data  communications  and  task  scheduling.  Several  ordering 
algorithms  [BET,  HUA,  ZHO]  have  been  proposed  which  attempt  to  minimize 
the  depth  of  the  task  graph  without  significantly  increasing  the  total  number 
of  matrix  operations.  These  algorithms  require  a  method  for  monitoring  the 
growth  of  the  task  graph  depth  while  choosing  pivots  to  be  eliminated. 

We  propose  another  ordering  algorithm  which  also  attempts  to  minimize  the 
task  graph  depth  and  keep  the  total  number  of  operations  from  increasing  much. 
The  basic  idea  is  that  at  a  step  in  the  elimination  process,  a  set  of  candidate 
diagonal  pivots  is  constructed  from  these  diagonal  pivots  with  low  Markowitz 
counts.  From  the  set  of  candidate  pivots,  a  large  independent  set  is  extracted, 
that  is.  a  set  for  which  if  two  pivots  i  and  j  are  in  the  set  a{j  and  aJt  are  both 
zero.  All  the  pivots  in  an  independent  set  can  be  processed  concurrently  with 
no  conflicts,  except  that  more  than  one  pivot  may  contribute  a  term  to  the  same 
update  destination.  The  algorithm  uses  an  integer  parameter  a  >  0  which  can 
be  tuned  empirically  for  best  performance. 

Large  Independent  Set  (LIS)  Reordering 

Repeat  until  the  elimination  is  completed 
S=  {  } 

mcount=  minimum  Markowitz  count  of  remaining  pivots 
For  d  =  mcount  to  mcount  +  a 

For  each  pivot  v  of  Markovcitz  count  d 

If  5  +  u  is  an  independent  set  then  S  =  S  +  v 
Eliminate  using  the  pivots  in  the  set  S ,  creating  fill-ins 

The  resulting  task  graph  depths  for  the  LIS  ordering  algorithm  and  several 
other  ordering  techniques  applied  to  a  collection  of  matrices  are  given  in  Ta¬ 
ble  1.  Each  column  corresponds  to  a  circuit  matrix,  except  P1000  and  P2047 
are  1000  node  and  2047  node  tridiagonal  matrices.  Note  that  the  minimum 
depth/minimum  degree[BET]  algorithm  frequently  produces  the  smallest  depth 
of  the  ordering  methods,  however  it  tends  to  significantly  increase  the  total 
number  of  operations,  making  it  less  desirable  when  only  a  limited  number  of 
processors  is  available  or  when  overhead  factors  are  taken  into  account.  The 
peculiar  structure  of  the  ALU  circuit  produces  unusual  results.  For  this  circuit 
the  Markowitz  and  minimum  degree/minimum  depth  orderings  produce  exces¬ 
sive  fill-ins  which  are  avoided  in  the  other  orderings.  For  the  other  circuits,  the 
orderings  which  try  to  minimize  the  task  graph  depths  result  in  depths  up  to 
about  509f  less  than  the  Markowitz  method. 


Table  1.  Task  graph  depths. 


PLA 

OPAMP 

RAM 

RAM2 

P1000 

ALU 

P2047 

m 

35 

32 

66 

129 

1998 

387 

4092 

a 

24 

26 

45 

98 

1000 

264 

2047 

b 

18 

28 

30 

125 

26 

25 

30 

zl 

26 

29 

49 

91 

1000 

52 

2047 

z2 

19 

28 

36 

84 

29 

27 

32 

w 

23 

26 

42 

79 

30 

30 

33 

1 

19 

28 

43 

91 

27 

27 

30 

w* 

19 

24 

39 

75 

28 

27 

31 

1* 

19 

28 

42 

75 

26 

26 

30 

0 

23 

26 

Key:  m=Markowitz.  a=min  degree/min  depth[BET],  b=min  depth/  min  de- 
gree[BET],  zl,z2=Zhou  stages  1  and  2[ZH0),  wsWing/Huan^  pivot  order- 
ing[HUA],  1=LIS  with  a=2,  *=optimum  update  ordering  [HUA]  applied  to  the 
pivot  ordering.  o=globallv  optimum  pivot  and  update  ordering  (when  it  can  be 
computed). 

2.3  The  Question  of  Optimality 

Since  the  heuristic  ordering  algorithms  are  not  guaranteed  to  find  the  globally 
minimum  depth  ordering,  we  are  faced  with  the  question  of  how  close  the  result¬ 
ing  depths  are  to  the  minimum  possible  depth.  For  a  circuit  of  any  significant 
size,  there  are  too  many  possible  orderings  to  try  them  all.  Random  searches  for 
optimum  orderings,  nested  dissection,  and  simulated  annealing  did  not  produce 
orderings  that  were  any  better  than  the  ones  found  by  the  heuristics  above. 

The  quality  of  the  orderings  produced  by  the  heuristics  can  be  measured  if 
analytic  techniques  can  be  used  to  find  meaningful  lower  bounds  on  the  task 
graph  depth  It  is  possible  to  prove  a  useful  result  for  tridiagonal  matrices[SMA] 
Tridtagonal  matrices  are  relatively  easy  to  analyze  because,  regardless  of  the 
pivot  order,  after  the  elimination  of  a  pivot  the  resulting  uneliminated  portion 
of  the  matrix  can  be  reorganized  into  a  tridiagonal  matrix  of  lower  order. 

Theorem  1  For  any  d  >  0  ,  the  largest  n  such  that  the  depth  of  the  task  graph 
associated  with  a  indiagonal  matrix  of  sue  n  will  be  less  than  or  equal  to  d  is 

given  by 

td/zj  d—2j  ,  .  v 

2  Lrf/  sj  + 1  _  i  +  £  E(i)-  (1) 

j  =  [d/3J  +  l  i'=0  '  ' 

The  minimum  possible  task  graph  depths  for  two  tridiagonal  matrices  were  com¬ 
puted  based  on  (1),  and  these  optimum  depths  are  given  in  Table  1.  The  small 
differences  between  the  optimum  results  and  the  results  for  the  LIS  algorithm 
are  due  to  the  fact  that  LIS  tries  to  choose  as  many  pivots  as  possible  at  each 
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stage  and  does  not  consider  conflicts  that  will  result  later  in  adding  terms  to 
common  update  locations.  The  poor  performance  of  the  Markowitz  and  mini¬ 
mum  degree-based  methods  for  tridiagonal  matrices  is  due  to  the  fact  that  there 
are  many  pivots  which  can  be  processed  in  parallel  but  are  not  selected  by  the 
algorithm  because  they  do  not  have  minimum  degree. 

3  Parallel  Sparse  Matrix  Solution  by  Relax¬ 
ation 

Another  approach  to  solving  Ax  =  b  is  to  solve  each  i,h  row  equation  for  i,, 
moving  the  other  unknowns  to  the  right  hand  side  by  replacing  them  with 
guessed  values.  The  values  for  the  unknowns  so  approximated  can  be  used  to 
improve  the  guessed  values  used  in  each  row  computation,  with  the  hope  of 
improving  the  approximation.  This  procedure  can  be  repeated,  until  the  ap¬ 
proximations  stop  improving  significantly.  Algorithms  of  this  form  are  referred 
to  as  relaxation  algorithms,  and  are  not  commonly  used  in  general  circuit  simu¬ 
lation  programs  because  the  approximations  don’t  always  approach  the  correct 
solution.  However,  when  applied  to  the  specific  problem  of  the  transient  analy¬ 
sis  of  MOS  integrated  circuits,  relaxation  algorithms  do  converge  reliably  (with 
some  ‘  tuning”).  In  addition,  relaxation  algorithms  are  more  easily  parallelized, 
and  this  has  renewed  interest  in  them[SAL,  WHI,  DUE]. 

When  constructing  a  relaxation  iterative  procedure,  one  has  two  choices 
about  how  to  update  the  unknowns.  Either  one  can  first  solve  all  the  row- 
equations  with  some  existing  set  of  values  for  the  unknowns,  and  then  replace 
all  the  unknowns  with  the  improved  approximations;  or  as  one  can  solve  a 
particular  row  and  immediately  replace  the  associated  unknown  before  solving 
the  next  row.  The  element  update  equation  for  the  former  approach,  referred 
to  as  Gauss-Jacobi  (GJ),  can  be  written  compactly  as 

xf+1  =  —  b,~  dijXj  -  dijx)  ,  (2) 

Q"  ;=i  ;=■>! 

and  the  latter  approach,  referred  to  as  Gauss-Seidel  (GS),  can  be  written  as 

xf+1  =  —  bi  -  a,->z*+1  -  53  -  (3) 

“  L  j=l  >=•+! 

where  k  is  the  iteration  index. 

When  relaxation  methods  are  applied  to  the  matrices  associated  with  the 
transient  simulation  of  MOS  circuits,  GS  converges  much  faster  than  GJ,  be¬ 
cause  the  GS  relaxation  can  be  ordered  to  follow  the  strong  directionality  of 
such  circuits.  In  general,  one  would  expect  GS  to  be  faster  than  GJ  as  each 


»  j 


row  solution  uses  more  recent  information,  and  this  conclusion  is  supported  by 
the  Stein-Rosenberg  theory[VAR].  Some  parallel  relaxation-based  MOS  circuit 
simulators  use  the  GS  method  in  an  attempt  to  take  advantage  of  its  faster  con¬ 
vergence  speed[SAL.  DUE,  WHI].  However,  experimental  evidence  indicates, 
that  when  sufficiently  many  processors  are  used,  GJ  results  in  faster  solutions 
due  to  its  higher  degree  of  parallelism[SM A2,  MAT,  WEB], 

In  the  next  few  sections  we  show  that  the  empirical  evidence  about  the  supe¬ 
riority  of  GJ  on  parallel  processors  is  supported  by  a  parallel  theory  analogous 
to  the  Stein-Rosenberg  result.  In  the  next  subsection  we  present  a  simplified 
parallel  computation  model  in  order  to  reexamine  the  comparison  between  GS 
and  GJ.  We  then  demonstrate  that  applied  to  sparse  matrices,  GS  relaxation  has 
substantial  parallelism.  Finally,  in  the  last  subsection,  we  present  an  optimality 
result  for  GJ  over  GS  on  parallel  processors. 

3.1  A  Simplified  Parallel  Computational  Model 

The  unknown  update  equations  for  GJ  and  GS,  Eqns.  (2)  and  (3),  involve  the 
same  computation:  in  each  case  the  ith  unknown  is  updated  by  summing  n  —  1 
products  and  a  constant.  Parallelism  can  be  exploited  in  this  summation,  but 
it  will  be  the  same  for  both  GS  and  GJ.  The  difference  between  GS  and  GJ 
that  effects  how  much  total  parallelism  can  be  exploited  is  that  in  the  case  of 
GJ  all  the  unknowns  can  be  updated  simultaneously,  and  in  the  case  of  GS.  if 
A  is  full,  only  one  unknown  can  be  updated  at  a  time. 

In  order  to  more  easily  examine  the  difference  between  the  two  methods, 
we  will  treat  (2)  and  (3)  as  atomic  operations  which  can  be  computed  in  one 
processor  step.  In  this  notation,  and  assuming  sufficiently  many  processors,  one 
iteration  of  GJ  takes  one  step,  as  all  the  unknown  updates  can  be  performed 
simultaneously,  and  GS  takes  n  steps  if  A  is  full,  as  the  i,h  unknown  must  be 
updated  before  the  i  4-  1  update  equation  can  be  completed. 

3.2  Exploiting  Sparsity  in  Parallel  GS 

It  is  possible  to  exploit  the  sparsity  of  circuit  matrices  to  increase  the  parallelism 
of  GS  and  reduce  the  number  of  steps  needed  to  complete  an  iteration  to  well 
below  n  For  example,  if  a<+i,<  =  0,  then  x^  can  be  updated  simultaneously  with 
x,  +  i .  It  is  easy  to  calculate  exactly  how  many  steps  it  will  take  to  compute  a  GS 
iteration  on  a  sparse  matrix  by  examining  the  following  graph  constructed  from 
A.  Let  the  graph  have  n  nodes,  labeled  1  through  n.  For  each  a;,  ^  0,  i  <  j, 
place  a  directed  arc  from  node  i  to  node  j  in  the  graph.  The  result  is  an  acyclic 
directed  graph,  and  the  depth  of  this  graph  is  precisely  equal  to  the  number  of 
steps  required  to  complete  one  iteration  of  GS  on  a  parallel  processor. 

It  is  also  possible  to  exploit  more  parallelism  in  the  GS  algorithm  by  begin¬ 
ning  the  lc  +  l,h  iteration  before  completing  the  kth.  For  example,  ifaij  through 
oltt  are  all  zero,  then  one  can  compute  x*  +  1  without  waiting  for  x*  through  x* 


to  be  computed  first.  Combining  this  technique  of  overlapping  iterations  with 
simultaneously  updating  independent  z,  ’s  ,  aw  mentioned  above,  yields  a  paral¬ 
lel  GS  algorithm  which  is  best  characterized  by  the  average  number  of  processor 
steps  between  GS  iteration  completions.  We  will  refer  to  this  number  of  steps 
as  r.  It  should  be  noted  that  r  depends  only  on  the  nonzero  structure  of  A 
and  for  general  sparse  matrices  r  can  be  much  less  than  n.  For  example,  if  A 
is  tridiagonal,  r  =  2.  With  this  in  mind,  the  winner  in  a  comparison  between 
GJ  and  GS  on  parallel  processors  is  less  clear,  and  is  the  subject  of  the  next 
section. 

3.3  The  Optimality  of  Parallel  GJ  over  GS 

In  order  to  compare  the  G  J  and  GS  procedures,  we  will  consider  their  asymptotic 
convergence  properties.  In  particular,  let  A  —  L  -+■  D  +  U  where  L,  D ,  and  U 
are  strictly  lower  triangular,  diagonal,  and  strictly  upper  triangular  matrices, 
respectively.  Define  the  GJ  and  GS  iteration  matrices  Mgj  =  —D~1(L+U)  and 
Mg s  =  — ( L  +  D)~1U ■  It  is  well  known  that  the  asymptotic  rates  of  convergence 
of  GJ  and  GS  are  related  to  p(Mgj)  and  p(Mgs )  respectively[VAR],  where  p 
denotes  spectral  radius.  Since  m  iterations  of  parallel  GJ  require  m  steps  and 
in  iterations  of  parallel  GS  require  r  m  steps,  in  the  limit  as  m  — »  oo.  it  follows 
that  parallel  G  J  is  asymptotically  faster  than  parallel  GS  if  p{MgjY  is  less  than 
p{Mgs)-  For  a  large  class  of  matrices  it  can  be  proved  that  this  is  the  case, 
and  therefore  parallel  GJ  is  asymptotically  faster  than  parallel  GS.  The 
result  is  precisely  stated  in  the  following  theorem: 

Theorem  2  If  the  elements  of  Mgj  are  nonnegative,  and  p(Mgj)  <  1,  then 

P(MgjY  <  p(Mgs)- 

Examining  nonnegative  convergent  iteration  matrices  follows  the  Stein- Rosenberg 
approach  to  the  GS/GJ  comparisonfVAR].  It  is  also  true  that  the  matrices  as¬ 
sociated  with  the  transient  simulation  of  MOS  circuits,  if  the  discretization 
timestep  is  small[WHI],  satisfy  the  conditions  of  the  theorem,  which  suggests 
the  comparison  plays  a  direct  role  in  practice. 

The  proof  of  Theorem  2  utilizes  the  following  lemma[GAN]  and  definition: 

Lemma  1  Perron-Frobemus:  If  matrix  M  €  SJ”*"  is  nonnegative,  then  it  has  a 
nonnegatne  real  eigenvalue  equal  to  its  spectral  radius  and  a  nonnegative  eigen¬ 
vector  associated  with  that  eigenvalue. 

Definition  1  Define  iPGJl  ,xPGSl  €  9Rn  to  be  the  vectors  of  most  recently  up¬ 
dated  elements  produced  by  I  processor  steps  of  parallel  GJ  and  GS  algorithms 
respectively 


Since  a  GJ  iteration  finishes  in  one  processor  step,  xPGJl  =  xl ,  the  Ith  iterate 
of  a  GJ  iteration,  and  therefore 


xPGJi+ 1  _ 


£+  E 


;  =  n 
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a'i  -PGJi 

ST' 
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In  general.  XPGSI  never  corresponds  to  any  xk  iterate  of  GS,  because  overlap¬ 
ping  of  the  iterations  implies  that  the  elements  of  x  most  recently  updated  can 
correspond  to  different  iterations.  And  because  of  data  dependencies  inherent 
in  the  GS  algorithm,  many  of  the  elements  of  xPGSl  are  the  same  as  those  of 
rPGSi+ 1  Therefore,  each  GS  update  will  be  given  by  either 

xpas,+ 1  =  xPGSI  (5) 


or 


J.PGSI  + 


1  _  j  y'  Gij  xPGSm(j) 
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(6) 


where  m(j )  £  {0 . /}.  Note  that  m(j)  is  used  to  indicate  that  in  order  to 

follow  the  GS  update  formula,  it  may  be  necessary  to  pick  out  elements  from 
several  different,  but  earlier,  xPGSmU)  vectors. 

In  order  to  complete  the  proof,  we  now  consider  solving  Ax  =  0  by  relaxation, 
where  A  is  such  that  Mgj  exists  and  is  nonnegative  and  p(Mgj)  <  1  Let  the 
initial  guess  x°  =  XPGS0  =  XPGJ0  be  a  nonnegative  eigenvector  associated  with 
a  nonnegative  eigenvalue  of  Mgj  equal  to  p(Mcj)-  Since  Mgj  is  nonnegative, 
a,;/aM  S  0  for  all  1  ^  j  Consequently,  since  x°  is  nonnegative  and  6  =  0,  each 
term  in  (4)  and  (6)  is  nonnegative  for  all  «,/.  Also,  since  x°  is  an  eigenvector 
associated  with  an  eigenvalue  equal  to  p(Mgj)  <  h  xPGJI  =  p(Mgj)'x°  and 
therefore  iPGJI  decays  monotonically  with  l  for  all  i. 

Suppose  p{MgjY  >  p(Mgs)<  then  in  the  limit  as  /  — ♦  oo  the  nonnegative 
vector  xPGSI  will  be  less  than  xPGJl .  We  will  show  by  induction  that  this  can 
not  happen,  and  this  will  complete  the  proof.  First,  assume  xPGJI  <  xPGSm 

for  all  i  and  all  m  £  {0 . /}.  This  assumption  clearly  holds  for  l  =  0,  thus 

forming  the  basis  of  the  induction.  In  those  cases  where  (5)  applies,  xPGSI+l  = 
xPGS‘  >  xPGJI ,  and  xPG9l+1  >  XPGJI+1  because  xPGJI  is  monotone  decreasing 
in  /.  In  those  cases  where  (6)  applies,  each  term  of  the  summation  in  (4) 
is  less  than  or  equal  to  the  corresponding  term  in  (6)  and  all  the  terms  are 
nonnegative.  Hence  xPGSl+1  >  XPGJI+1 .  Since  xPGJl  decreases  monotonically, 
xPGJI+ 1  <  xPGJm  <  xPGSm  for  all  m  <  l.  Consequently,  XPGJ,+  1  <  xPGs'm 
for  all  t  and  all  rn  £  {0 . /  +  1}.  thus  completing  the  induction  and  the  proof 


4  Conclusions  and  Acknowledgements 

Matrix  reordering  heuristics  for  parallel  processing  have  been  shown  to  be  effec¬ 
tive  in  reducing  the  depth  of  the  LU  factorization  task  graph.  However,  for  most 


of  the  circuit  examples  investigated  the  improvement  in  depth  was  less  than  50% 
,  and  the  benefit  of  this  improvement  is  only  realizable  on  a  large  number  of 
processors.  One  reason  for  this  is  that  the  Markowitz  algorithm  which  was  used 
as  the  basis  for  comparison  tends  to  produce  small  task  graph  depths  simply 
by  keeping  the  total  number  of  operations  small.  For  certain  matrix  structures, 
the  heuristics  which  attempt  to  maximize  parallelism  produce  vastly  superior 
results,  such  as  for  tridiagonal  matrices.  In  the  examples  that  were  considered, 
the  LIS  algorithm  consistently  produced  good  orderings. 

In  the  examination  of  relaxation  methods  we  proved  a  result  that  indicates 
on  a  parallel  processor  G  J  relaxation  is  almost  certainly  superior  to  GS,  almost 
regardless  of  the  numerical  character  of  the  problem.  The  result,  we  think,  is 
also  interesting  because  it  associates  the  spectral  radius  of  a  matrix  to  some  of 
its  graphical,  rather  than  numerical,  properties.  For  example,  Theorem  2  leads 
to  the  somewhat  surprising  conclusion  that  if  a  tridiagonal  matrix  satisfies  the 
conditions  of  the  theorem,  then  the  the  spectral  radius  of  its  associated  GS 
iteration  matrix  is  no  less  than  half  the  spectral  radius  of  its  associated  GJ 
iteration  matrix. 

The  authors  would  like  to  acknowledge  the  many  valuable  discussions  with  A 
Sangiovanni-Vincentelli.  Tim  Trick.  Don  Webber,  Res  Saleh.  Vish  Visvanathan. 
and  Yasant  Rao.  In  addition,  we  would  like  to  mention  the  paper  by  Chazan  and 
Miranker[CHA].  from  which  some  of  the  ideas  for  the  analysis  were  derived.  This 
research  was  funded  in  part  by  Sandia  National  Laboratory  contract  02-8522. 
Semiconductor  Research  Corporation  contract  86-12-109.  and  by  the  Defense 
Advanced  Research  Projects  Agency  under  contract  N00014-87-K-0825. 
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